Speculative forwarding in a high-radix router

ABSTRACT

A system and method for speculative forwarding of packets received by a router, wherein each packet includes phits and wherein one or more phits include a cyclic redundancy code (CRC). A packet is received and phits of the packet are forwarded to router logic. A cyclic redundancy code for the packet is calculated and compared to the packet&#39;s cyclic redundancy code. An error is generated if the cyclic redundancy codes don&#39;t match. If the cyclic redundancy codes don&#39;t match, a phit of the packet is modified to reflect the error, the CRC is corrected and the corrected CRC is forwarded to the router logic along with the phit reflecting the CRC error. At the router logic, a check is made to see if the packet is still within the router logic. If the packet is still within the router logic and there was a CRC error, the packet is discarded. If, however, the packet is no longer within the router logic and there was a CRC error, the packet is modified so that the next router discards the packet.

RELATED APPLICATIONS

This application claims the priority benefit of U.S. ProvisionalApplication Ser. No. 60/925,470 filed Apr. 20, 2007, the contents ofwhich is incorporated herein by reference in its entirety.

This application is related to U.S. patent application Ser. No. ______,entitled “HIGH-RADIX INTERPROCESSOR COMMUNICATIONS SYSTEM AND METHOD”,filed on even date herewith (Atty. Docket No. 1376.770US1); to U.S.patent application Ser. No. ______, entitled “FLEXIBLE ROUTING TABLESFOR A HIGH-RADIX ROUTER”, filed on even date herewith (Atty. Docket No.1376.766US1); and to U.S. patent application Ser. No. ______, entitled“LOAD BALANCING FOR COMMUNICATIONS WITHIN A MULTIPROCESSOR COMPUTERSYSTEM”, filed on even date herewith (Atty. Docket No. 13176.767US1);each of which is incorporated herein by reference in its entirety.

FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

The U.S. Government has a paid-up license in this invention and theright in limited circumstances to require the patent owner to licenseothers on reasonable terms as provided for by the terms of contract No.MDA904-02-3-0052, awarded by the Maryland Procurement Office.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention is related to multiprocessor computer systems, andmore particularly to a system and method for routing packets in amultiprocessor computer system.

2. Background Information

BACKGROUND INFORMATION

The interconnection network plays a critical role in the cost andperformance of a scalable multiprocessor. It determines thepoint-to-point and global bandwidth of the system, as well as thelatency for remote communication. Latency is particularly important forshared-memory multiprocessors, in which memory access andsynchronization latencies can significantly impact applicationscalability, and is becoming a greater concern as system sizes grow andclock cycles shrink.

It is common practice to protect a packet received at a router from anincoming link with a cyclic redundancy checksum (CRC). The CRC ensuresreliable delivery of the packet over the link. Checking the CRC takestime; in order to guarantee that the packet is correct, the packet isdelayed until the CRC is checked before the packet is allowed to proceedthrough the router and out the next link in the path through thenetwork. This store-and-forward approach adds latency to every hop. Thislatency is higher in high-radix routers, which have narrower links andthus higher latency to receive a given packet over a single link.

There are two existing approaches to reducing this store-and-forwardlatency. The first is to break packets up into smaller pieces(micropackets), each protected by its own CRC. The downside to this isthat it increases the CRC overhead (relative number of transmitted bitsspent on CRCs). It also only partially solves the problem, as there isstill store-and-forward latency of the micropackets.

The other solution is to not check packets for errors on each hop. Thisrequires either giving up on reliable transmission at high speed, orelse using an end-to-end reliable delivery protocol. End-to-endprotocols, however, are very complex and require O(N̂2) state for Nnodes, so are not scalable to large system sizes. They also can haveperformance issues in large systems, as the probability of an error onan end-to-end path is much higher than the probability of an error on asingle link.

What is needed is a system and method for reducing packet-forwardinglatency in a router, while maintaining reliable communication.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a computer system with a high-radix routing system;

FIG. 2 illustrates the computer system of FIG. 1 with uplinks to higherrank routers;

FIG. 3 (a)-(c) illustrate network topologies for computer systems;

FIG. 4 illustrates one embodiment of a router for systems of FIGS. 1-3;

FIG. 5 illustrates latency in transfer of a packet through the router ofFIG. 4;

FIG. 6 illustrates one example embodiment of a packet format that can beused in the router of FIG. 4; and

FIG. 7 illustrates packet traversal through the router of FIG. 4.

DETAILED DESCRIPTION OF THE INVENTION

In the following detailed description of the preferred embodiments,reference is made to the accompanying drawings which form a part hereof,and in which is shown by way of illustration specific embodiments inwhich the invention may be practiced. It is to be understood that otherembodiments may be utilized and structural changes may be made withoutdeparting from the scope of the present invention.

A computer system is shown in FIG. 1. In the computer system 100 of FIG.1, processor nodes 102.1 through 102.N are connected by links 104 torouters 106. In the embodiment shown, each processor node 102 includesfour injection ports, wherein each connection port is connected to adifferent router 106. In addition, each processor node 102 includeslocal memory and one or more processors. Each router 106 is a high-radixrouter as will be described below.

In one embodiment, computer system 100 is designed to run demandingapplications with high communication requirements. It is a distributedshared memory multiprocessor built with high performance, high bandwidthcustom processors. The processors support latency hiding, addressing andsynchronization features that facilitate scaling to large system sizes.

It provides a globally shared memory with direct global load/storeaccess. In one such embodiment, system 100 is globally cache coherent,but each processor only caches data from memory 112 within itsfour-processor node 102. This provides natural support for SMPapplications on a single node, and hierarchical (e.g.: shmem or MPI ontop of OpenMP) applications across the entire machine. Pure distributedmemory applications (MPI, shmem, CAF, UPC) are supported as well.

In one such embodiment, each processor is implemented on a single chipand includes a 4-way-dispatch scalar core, 8 vector pipes, two levels ofcache and a set of ports to the local memory system. Each processor insystem 100 can support thousands of outstanding global memoryreferences.

For such embodiments, the network should be designed to provide veryhigh global bandwidth, while also providing low latency for efficientsynchronization and scalability. To accomplish this, in one embodiment,routers 106 are interconnected in a high-radix folded Clos or fat-treetopology with sidelinks. By providing sidelinks, one can staticallypartition the global network bandwidth among the peer subtrees, reducingthe cost and the latency of the network.

In the embodiment shown in FIG. 2, computer system 120 uses high-radixrouters 106, each of which has 64 ports that are three bits wide in eachdirection. In the embodiment shown, each processor node 102 has fourinjection ports into the network, with each port connecting to adifferent network slice. Each slice is a completely separate networkwith its own set of routers 106. The following discussion will focus ona single slice of the network.

By using a high-radix router with many narrow channels we are able totake advantage of the higher pin density and faster signaling ratesavailable in modern ASIC technology. In one embodiment, router 106 is an800 MHz ASIC with 64 18.75 Gb/s bidirectional ports for an aggregateoffchip bandwidth of 2.4 Tb/s. Each port consists of three 6.25 Gb/sdifferential signals in each direction. The router supportsdeterministic and adaptive packet routing with separate buffering forrequest and reply virtual channels. The router is organizedhierarchically as an 8×8 array of tiles which simplifies arbitration byavoiding long wires in the arbiters. Each tile of the array contains arouter port, its associated buffering, and an 8×8 router subswitch.

In one embodiment, computer system 120 scales up to 32K processors usinga variation on a folded-Clos or fat-tree network topology that can beincrementally scaled. In one such embodiment, computer system 120 ispackaged in modules, chassis, and cabinets. Each compute module containseight processors with four network ports each.

In one embodiment, a chassis holds eight compute modules organized astwo 32-processor rank 1 (R1) subtrees, and up to four R1 router modules(each of which provides two network slices for one of the subtrees).Each R1 router module contains two 64-port YARC router chips (see FIG.2) providing 64 downlinks that are routed to the processor ports via amid-plane, and 64 uplinks (or sidelinks) that are routed to eight 96-pincable connectors that carry eight links each. (“YARC” stands for “YetAnother Routing Chip.”)

In one such embodiment, each cabinet holds two chassis (128 processors)organized as four 32-processors R1 subtrees. Machines with up to 288processors, nine R1 subtrees, can be connected by directly cabling theR1 subtrees to one another using sidelinks 108 as shown in FIGS. 3( a)and (b) to create a rank 1.5 (R1.5) network.

To scale beyond 288 processors, uplink cables 110 from each R1 subtreeare connected to rank 2 (R2) routers 112. A rank 2/3 router module (FIG.3( c)) packages four routers 106 on an R2/R3 module.

In one embodiment, the four radix-64 routers 106 on the R2/R3 module areeach split into two radix-32 virtual routers. Logically, each R2/R3module has eight radix-32 routers providing 256 network links on 32cable connectors. Up to 16 R2/R3 router modules are packaged into astand-alone router cabinet.

Machines of up to 1024 processors can be constructed by connecting up tothirty-two 32-processor R1 subtrees to R2 routers. Machines of up to4.5K processors can be constructed by connecting up to nine512-processor R2 subtrees via side links 108. Up to 16K processors maybe connected by a rank 3 (R3) network where up to thirty-two512-processor R2 subtrees are connected by R3 routers. Networks havingup to 72K processors could be constructed by connecting nine R3 subtreesvia side links 108.

The above topology and packaging scheme enables very flexibleprovisioning of network bandwidth, For instance, by only using a singlerank 1 router module (instead of two as shown in FIG. 2), the portbandwidth of each processor is reduced in half-halving both the cost ofthe network and its global bandwidth. An additional bandwidth taper canbe achieved by connecting only a subset of the rank 1 to rank 2 networkcables, reducing cabling cost and R2 router cost at the expense of thebandwidth taper.

The Router

The input-queued crossbar organization often used in low-radix routersdoes not scale efficiently to high radices because the arbitration logicand wiring complexity both grow quadratically with the number of inputs.To overcome this complexity, in one embodiment, router 106 is organizedusing a hierarchical organization in a manner similar to that proposedby Kim et al. above.

As shown in FIG. 4, in one embodiment router 106 is organized as an 8×8array of tiles 200 within a single YARC chip 201. Each tile 200 containsall of the logic and buffering associated with one input port 190 andone output port 192. Each tile 200 also contains an 8×8 switch 202 andassociated buffers (212, 214). Each tile's switch 202 accepts inputsfrom eight row buses 204 that are driven by the input ports 190 in itsrow, and drives separate output channels 206 to the eight output ports192 in its column. Using a tile-based microarchitecture facilitatesimplementation, since each tile is identical and produces a very regularstructure for replication and physical implementation in silicon.

In one embodiment, computer systems 100 and 120 use two virtual channels(VCs), designated request (v=0) and response (v=1) to avoidrequest-response deadlocks in the network. Therefore, all bufferresources are allocated according to the virtual channel bit in the headphit. Each input buffer is 256 phits and is sized to cover theround-trip latency across the network channel. Virtual cut-through flowcontrol is used across the network links. In one such embodiment, eachVC drives its own row bus 204. This provides some row bus speedup sinceyou can flow request and response flits onto row busses simultaneously.It also eliminates the need for arbitration for the row busses 204.

The router 106 microarchitecture is best understood by following apacket through the router. A packet (such as packet 300 shown in FIG. 6)arrives in the input buffer 210 of a tile 200 (fed from the incominglink control block (LCB).)

When the packet reaches the head of the buffer a routing decision ismade at route selector 218 to select the output column 208 for thepacket. The packet is then driven onto the row bus 204 associated withthe input port 190 and buffered in a row buffer 212 at the input of the8×8 switch 202 at the junction of the packet's input row and outputcolumn. At this point the routing decision must be refined to select aparticular output port 192 within the output column 208. The switch 202then routes the packet to the column channel 206 associated with theselected output port 192. The column channel delivers the packet to anoutput buffer 214 (associated with the input row) at the output portmultiplexer 216. Packets in the per-input-row output buffers 214arbitrate for access to the output port 192 and, when granted access,are switched onto output port 192 via the multiplexer 216.

In the embodiment shown in FIG. 4, router 106 includes three types ofbuffers: input buffers 210, row buffers 212, and column buffers 214.Each buffer is partitioned into two virtual channels. One input buffer210 and 8 row buffers 212 are associated with each input port 190. Thus,no arbitration is needed to allocate these buffers—only flow control.Eight column buffers 214 are associated with each subswitch 202.Allocation of the column buffers 214 takes place at the same time thepacket is switched.

Output arbitration is performed in two stages. The first stage ofarbitration is done to gain access to the output of the subswitch 202. Apacket then competes with packets from other tiles 200 in the samecolumn 208 in the second stage of arbitration for access to the outputport 192. Unlike the hierarchical crossbar in Kim, however, router 106takes advantage of the abundant on-chip wiring resources to run separatechannels 206 from each output of each subswitch 202 to the correspondingoutput port 192. This organization places the column buffers 214 in theoutput tiles 200 rather than at the output of the subswitches 202.Co-locating the eight column buffers 214 associated with a given outputin a single tile 200 simplifies global output arbitration. With columnbuffers 214 at the outputs of the subswitch 202, the requests/grantsto/from the global arbiters would need to be pipelined to account forwire delay, which would complicate the arbitration logic.

In one embodiment of the router 106 of FIG. 4, a packet traversingrouter 106 passes through 25 pipeline stages, resulting in a zero-loadlatency of 31.25 ns. A pipeline diagram illustrating passage throughsuch a router 106 is shown in FIG. 5. In one embodiment, each majorblock: input queue (210, 212), subswitch 202, and column buffers 214 isdesigned with both input and output registers. This approach simplifiedsystem timing and design at the expense of latency. During the design,additional pipeline stages were inserted to pipeline the wire delayassociated with the row busses and the column channels.

The Communication Stack

The communication stack in computer systems 100 and 120 can beconsidered as three layers: network layer, data-link layer, and physicallayer. We discuss the packet format, flow control across the networklinks, the link control block (LCB) which implements the data-linklayer, and the serializer/deserializer (SerDes) at the physical layer.

One embodiment of a packet that can be used in computer systems 100 and120 is shown in FIG. 6. In one embodiment, packets are divided into24-bit phits for transmission over internal datapaths. These phits arefurther serialized for transmission over 3-bit wide network channels. Aminimum packet contains 4 phits carrying 32 payload bits.

Longer packets are constructed by inserting additional payload phits(like the third phit in the figure) before the tail phit. Two-bits ofeach phit, as well as all of the tail phit are used by the data-linklayer.

The head phit of the packet controls routing. In addition to specifyingthe destination, this phit contains a v bit that specifies which virtualchannel to use, and three bits, h, a, and r, that control routing.

If the r bit is set, the packet will employ source routing. That is, thepacket header will be accompanied by a routing vector that indicates thepath through the network as a list of ports used to select the outputport 192 at each hop. Source routed packets are normally used only formaintenance operations such as reading and writing configurationregisters on router 106.

If the a bit is set, the packet will route adaptively, otherwise it willroute deterministically.

If the h bit is set, the deterministic routing algorithm employs thehash bits in the second phit to select the output port 192.

Network flow control will be discussed next. The allocation unit forflow control is a 24-bit phit—thus, the phit is really the flit (flowcontrol unit). In one embodiment, as noted above, computer systems 100and 120 use two virtual channels (VCs), designated request (v=0) andresponse (v=1) to avoid request-response deadlocks in the network.Therefore, all buffer resources are allocated according to the virtualchannel bit in the head phit. Each input buffer is 256 phits and issized to cover the round-trip latency across the network channel.Virtual cut-through flow control is used across the network links. Inone embodiment, each VC drives its own row bus 204.

The data-link layer will be discussed next. In one embodiment, thedata-link layer protocol is implemented by the link control block. TheLCB receives phits from router 106 and injects them into the serializerlogic where they are transmitted over the physical medium. The incomingLCB feeds directly to the input buffers 210.

The primary function of the LCB is to reliably transmit packets over thenetwork links using a sliding window go-back-N protocol. The send bufferstorage and retry is on a packet granularity. The link control block isdescribed in greater detail in “Inter-ASIC Data Transport Using LinkControl Block Manager,” U.S. patent application Ser. No. 11/780,258,filed Jul. 19, 2007, the description of which is incorporated byreference.

In the embodiment shown in FIG. 6, the 24-bit phit uses 2-bits ofsideband dedicated as a control channel for the LCB to carry sequencenumbers and status information. The virtual channel acknowledgmentstatus bits travel in the LCB sideband. These VC acks are used toincrement the pervc credit counters in the output port logic. The okfield in the EOP phit indicates if the packet is healthy, encountered atransmission error on the current link (transmit_error), or wascorrupted prior to transmission (soft_error).

If the LCB receives a packet with a CRC error, then corruption has justoccurred while traversing the incoming link. The LCB enters an errorrecovery mode, and, assuming that the error was transient, a goodversion of the packet will eventually be received and handed up to therouter core. In the meantime, however, the LCB has likely started toforward the corrupt packet up to the router core. To handle this, when aCRC error is detected, the LCB sets the status code in the tail phit toPACKET_BAD_WILLRETRY and recomputes the CRC before handing the tail phitup to the router core. This tells the router core logic that the packetis going to be retransmitted, and should be discarded if possible. Thehigher level flow control that manages the space in the router core'sinput buffer should not acknowledge receipt and consumption of thispacket, because we cannot trust any of the packet contents, includingthe virtual channel number.

If the corrupted packet cannot be discarded before it beginstransmitting over the next link in the network, then the status in itslast phit is set to PACKET_BAD by the output port's LCB beforetransmitting it. Thereafter, assuming no further transmission errors,the packet will flow across the network marked as a bad packet (but witha good CRC), and will be discarded at the destination, as discussedabove.

The physical layer will be discussed next. The serializer/deserializer(SerDes) implements the physical layer of the communication stack. Inone embodiment, router 106 instantiates a high-speed SerDes in whicheach lane consists of two complimentary signals making a balanceddifferential pair.

In one embodiment, the SerDes is organized as a macro which replicatesmultiple lanes. For full duplex operation, an 8-lane receiver and an8-lane transmitter macro are instantiated. In one such embodiment,router 106 instantiates forty-eight (48) 8-lane SerDes macros,twenty-four (24) 8-lane transmit and twenty-four (24) 8-lane receivemacros, consuming approximately of the available silicon in a full ASICimplementation of router 106.

In one embodiment, the SerDes supports two full-speed data rates: 5 Gbpsor 6.25 Gbps. Each SerDes macro is capable of supporting full, half, andquarter data rates using clock dividers in the PLL module. This allowsthe following supported data rates: 6.25, 5.0, 3.125, 2.5, 1.5625, and1.25 Gbps. This should be adequate to drive a 6 meter, 26 gauge cable atthe full data rate of 6.25 Gbps, allowing for adequate printed circuitboard (PCB) foil at both ends.

In one such embodiment, shown in FIG. 7, each port on router 106 isthree bits wide, for a total of 384 low voltage differential signalscoming off each router 106 (192 transmit and 192 receive). Since theSerDes macro 702 is 8 lanes wide and each router port is only 3 laneswide, a naive assignment of tiles to SerDes would have 2 and ⅔ ports (8lanes) for each SerDes macro. Consequently, in such an embodiment it canbe useful to aggregate three SerDes macros (24 lanes) to share acrosseight YARC tiles (also 24 lanes). This grouping of eight tiles is calledan octant (tiles belonging to the same octant are shown in FIG. 7) and,in one embodiment, imposes the constraint that each octant must operateat the same data rate.

In one embodiment, the SerDes has a 16/20 bit parallel interface whichis managed by the link control block (LCB). In one embodiment, thepositive and negative components of each differential signal pair can bearbitrarily swapped between the transmit/receive pair. In addition, eachof the 3 lanes which comprise the LCB port can be permuted or“swizzled.” The LCB determines which are the positive and negativedifferential pairs during channel initialization, as well as which lanesare “swizzled”. This degree of freedom simplifies the board-level riverrouting of the channels and reduces the number of metal layers on a PCBfor the router module.

Speculative Forwarding

As noted above, CRC is used to detect soft errors in the pipeline datapaths and static memories used for storage. As noted above, the narrowlinks of a high-radix router cause a higher serialization latency tosqueeze the packet over a link. For example, a 32B cache-line writeresults in a packet with 19 phits (6 header, 12 data, and 1 EOP).Consequently, the LCB passes phits up to the higher-level logicspeculatively, prior to verifying the packet CRC, which avoidsstore-and-forward serialization latency at each hop. However, this earlyforwarding complicates various error conditions in order to correctlyhandle a packet with a transmission error and reclaim the space in theinput queue at the receiver.

In one embodiment, we avoid a store-and-forward delay of a packet due toCRC checking at the router input port through speculative forwarding. Insuch an embodiment, individual phits (physical transfer units, eachcontaining 24 bits of packet data) are forwarded into the router core asthey are received, before the packet's CRC is checked. The head of thepacket may have already flowed out a router exit port and across a linkby the time the packet CRC is checked. The error handling protocols andbuffer management are designed to deal with the case that a CRC error isdetected.

In one embodiment, before transmitting a tail phit onto the networklink, the LCB checks the current CRC against the packet contents todetermine if a soft_error has corrupted the packet. If the packet iscorrupted, it is marked as soft_error, and a good CRC is generated sothat it is not detected by the receiver as a transmission error. Thepacket will continue to flow through the network marked as a bad packetwith a soft error and eventually be discarded by the network interfaceat the destination processor.

Because a packet with a transmission error is speculatively passed up tothe router core and may have already flowed to the next router by thetime the tail phit is processed, the LCB and input queue must preventcorrupting the router state.

The speculative forwarding mechanism must, therefore, take into accountthe possibility that a corruption could create a max-sized packet (theLCB will never allow a larger-than max-sized packet to be created) withan incorrect virtual channel to be handed up to the router core. Thetricky part of the whole mechanism is making sure that the router core'sflow control for the input buffer space is not corrupted, and that theinput buffer never overflows.

In one embodiment, the LCB detects packet CRC errors and marks thepacket as transmit_error with a corrected CRC before handing theend-of-packet (EOP) phit up to the router core. The LCB also monitorsthe packet length of the received data stream and clips any packets thatexceed the maximum packet length, which is programmed into an LCBconfiguration register. When a packet is clipped, an EOP phit isappended to the truncated packet and it is marked as transmit_error. Inone embodiment, the LCB will enter error recovery mode on either errorand await the retransmission.

The input queue in the router must be protected from overflow. If itreceives more phits than can be stored, the input queue logic willadjust the tail pointer to excise the bad packet and discard furtherphits from the LCB until the EOP phit is received. If a packet markedtransmit_error is received at the input buffer, we want to drop thepacket and avoid sending any virtual channel acknowledgments. The senderwill eventually timeout and retransmit the packet. If the bad packet hasnot yet flowed out of the input buffer, it can simply be removed bysetting the tail pointer of the queue to the tail of the previouspacket. Otherwise, if the packet has flowed out of the input buffer, welet the packet go and decrement the number of virtual channelacknowledgments to send by the size of the bad packet. The transmit-siderouter core does not need to know anything about recovering from badpackets. All effects of the error are contained within the LCB androuter input queueing logic.

In one embodiment, the link control block (LCB) modifies phits of areceived packet before sending the modified phit up to the router core.In one such embodiment, the last phit of a packet, which contains theCRC, also contains a status code indicating whether the packet is:

good (PACKET_OK);

corrupted, and will be retransmitted (PACKET_BAD_WILLRETRY); or

corrupted, but will not be re-transmitted (PACKET_BAD).

At the router core, packets that are received with good CRCs will eitherhave a status of PACKET_OK or PACKET_BAD. In either event, they arerouted as healthy packets through the network. At the destination, thepackets are fully received before being presented to the compute node,and any packet with a status of PACKET_BAD is dropped at that time.

Within the router, data is used before it is verified by the EOP CRC.Due to this, special care must be taken to make sure that channel errorsare managed by the router. Consider the implications within the routerof a single bit error in one of the channel control fields. If thepayload bit is flipped, it could either create an EOP/idle phit where itdoesn't belong (1->0) or cause one to be missed (0->1).

If an idle phit is created where it doesn't belong, it will be ignoredand the CRC will fail at the end of the packet. If an EOP phit iscreated where it doesn't belong, the CRC will be found bad immediately.In either case, a marked bad packet will be sent through the router, andall following data will be discarded by the LCB until reframing hasoccurred. This scenario doesn't cause potential buffer overflows.

A larger problem is created if the EOP is missed. This can create“super-packets,” where two consecutive packets look like they've beenmerged into one. This has the potential to overflow the input VC buffersin the router. Also, a bad vc bit in the head phit can cause overflow inthe input VC buffers.

The bottom line is that any error on the channel can result in a badpacket being handed up to the router. To deal with this, in oneembodiment, the LCB monitors packet length of the receive data streamand clips any packets that exceed this length (by interpreting the phitcorresponding to a maximum packet size (max_pkt_size) as a tail,regardless of its encoding). Such an approach should result in a badCRC). After this occurs, the LCB will be in error recovery mode, andwill ignore all incoming data until a re-transmission sequence isreceived.

At the same time, the input buffers in the router protect themselvesagainst overflow. If they receive more phits than can be stored, theinput buffer logic will adjust the queue tail pointer to remove the badpacket, and discard further phits from the LCB until an EOP is received.

In one embodiment, the LCB retry protocol begins on a packet boundary.The receiver logic keeps track of the last successfully received packet,and if the sender starts re-transmitting with an earlier packet, thereceiver throws away packets until receiving the first packet notpreviously received correctly.

After a channel error from which the LCB successfully recovers, the vccredits could be out of sync with each other because a bad packet haslanded in one of the vc input buffers. It may have landed in the wronginput queue due to an error in the vc bit, or it may be in the rightqueue. It also may possibly have grown or shrunk, if a tail bit wasflipped. It doesn't really matter. The bad packet itself is eitherconsuming input buffer space it shouldn't, or, if flowed out of theinput buffer, has generated acks that it shouldn't have. If the channelrecovers, the good packets will eventually be transmitted successfully.Since the higher level logic that manages vc credits is unaware of thebuffer space being consumed by the bad packet (or else it received toomany acks), the credits need to be adjusted when this occurs. Thestrategy for this is as follows:

1) The input fifo will be oversized by max_pkt_size;

2) Packet flow out of the input buffer will be virtual cut through(otherwise, the fifo would have to be oversized by two timesmax_pkt_size);

3) The input fifo enqueueing logic will:

-   -   a) Count phits as it enqueues packets;    -   b) Clip any packet that would otherwise overflow (actually        remove the offending packet);    -   c) If a bad packet is detected (either marked bad, or forced bad        due to clipping), if the packet has not yet flowed out of the        queue, it will be removed by setting the fifo tail pointer to        the tail of the previous packet. Otherwise, the credits-to-send        counter for the VC receiving the bad packet will be decremented        by the size of the bad packet.

The other end does not need to know anything about the bad packet. Alleffects of the error are contained within the input queueing logic ofrouter 106.

Routing

In one embodiment, routing in computing systems 100 and 120 is performedon variable length packets. The first phit of a packet is the header,which contains all the mandatory routing fields, and the last phit of apacket is an end of packet (EOP) phit which contains the packetchecksum.

In a folded-Clos topology, packet routing is performed in two stages:routing up to a common ancestor of the source and destinationprocessors, and then routing down to the destination processor. Uprouting can use either adaptive or deterministic routing. Down routing,however, is always deterministic, as there is only a single path downthe tree from any router to a destination processor.

Some systems 100 and 120 have a memory consistency model that requiresthat requests to the same address maintain ordering in the network. Insuch systems, request packets should use deterministic routing. Responsepackets do not require ordering, and so can be routed adaptively.

Packet routing is algorithmic and distributed. At each hop in thenetwork, routing logic at the head of the input queue calculates theoutput port for the local router. This is performed using routingregisters and an eight-entry routing table 220. The routing logic ofroute selector 218 is replicated in each tile 200, allowing multiplevirtual routers per physical router and providing the needed bandwidthfor parallel routing in all 64 tiles 200.

In the embodiments shown in FIGS. 3( a) and (b), there are three typesof links (i.e., routes): uplinks, sidelinks and downlinks. Uplinks gofrom the injection port to a rank 1 router or from a rank n router to arank n+1 router. Sidelinks go from a rank n router to a peer rank nrouter (only for R1.5, R2.5 and R3.5 networks). Downlinks go from a rankn router to a rank n−1 router or from a rank 1 router to the destinationprocessor.

En route from the source to the common ancestor, the packet will takeeither an uplink 110 or a sidelink 108 depending on the class of thenetwork (e.g.: rank 2 or rank 2.5, respectively). Upon arrival at thecommon ancestor, the router begins routing the packet down the fat treetoward its final destination using the downlinks.

In one embodiment, the down route is accomplished by extracting alogical port number directly from the destination processor number. Inone such embodiment, each router 106 in computer systems 100 and 120 has64 ports which have both a physical number, and an arbitrary logicalnumber. System software performs network discovery when the system isinitialized and assigns a logical port number to each physical portnumber.

Up and down routing will be discussed next. In one embodiment, each tile200 has a root detect configuration register that identifies the subtreerooted at this router 106, using a 15-bit router location and a 15-bitmask. As an example, the root detect register of a rank-1 routerconnected to destinations 96-127 would have a router location of 0x0060(96), and a mask of 0x001F (covering 32 destinations).

If the unmasked bits of the packet destination and the router locationmatch, then the destination processor is contained within the router'ssubtree, and the packet can begin traversing downward. Otherwise thepacket must continue to route up (or over if sidelinks are used).

In one embodiment, routing up or over is accomplished using aneight-entry table 220, where each entry contains a location and maskbits (like the root detect register) identifying a subtree of thenetwork. The packet destination is associatively checked against therouting table entries. The packet matches an entry if its destination iscontained within the subtree identified by that entry. The matchingentry then provides one or more uplinks/sidelinks that the packet mayuse to reach its destination. In deterministic routing, the routinglogic produces a deterministic exit port for each packet.

In a healthy network, only a single entry is required for up routing,matching the entire network, and identifying the full set of availableuplinks. In a system with faults, additional routing table entries areused to provide alternative uplinks for affected regions of the machine.If multiple entries match, then the entry with the highest index ischosen. Thus, entry 0 could be set to match the entire network, with afull uplink mask, and entry 1 could be set to match the subtree rootedat the fault, using a constrained uplink mask that avoids sendingpackets to a router that would encounter the fault en route to anydestination processors in that subtree.

A given network fault casts a shadow over some subtree of endpoints thatcan be reached going down from the fault. We only need fault entries inthe routing table for faults that do not cast a shadow over the localrouter. A router can also ignore a fault if it cannot be reached fromthis router (such as faults in another network slice).

In a router with configured sidelinks 108, each peer subtree is givenits own routing table entry, which defines the set of sidelinks 108usable to route to that subtree. No additional routing entries arerequired for faults.

In one embodiment, packets in the network adaptively route on aper-packet basis. In one embodiment, each packet header (FIG. 6) has anadapt a bit 300 that chooses the routing policy. If a=1 then the packetwill choose the output port adaptively during up or siderouting. Whenrouting adaptively, routing table 220 of the input port 190 produces a64-bit mask of allowable output ports 192. In one embodiment, the columnmask is formed by OR-ing together the eligible ports within eachcolumn—the resultant 8-bit mask will have bit i set if any of the eightoutput ports of column i are set in the output port mask produced by therouting table. After constructing the set of allowable columns, wechoose the winner (the eventual output column) based on the amount ofspace available in the row buffer for each column. Ties are brokenfairly using a matrix arbiter.

When the packet is sent across the row bus to the chosen column it isaccompanied by an 8-bit mask corresponding to the allowable output rowswithin that column. This row mask is used by the 8×8 subswitch 202 toselect an exit row. The row selection at the subswitch is guided by thespace available in the column buffers at the outputs, the row with themost space available in the column buffers is chosen.

Packets that are not marked as adaptive (a=0) are routeddeterministically based on the output of a hash function. To uniformlyspread the packets across the available uplinks, the hash function doesan XOR of the input port, destination processor, and optional hash bitsif the hash bit (h) is set in the packet header. The hash value is thenmapped onto the set of output links identified by the routing table. Theinput port and destination processor are hashed on to avoidnon-uniformities in many-to-one traffic patterns. For request packets,the hash bit is set, and a portion of the packet's address is includedin the hash function to further spread the traffic across the uplinks.In this way, we can load balance and still guarantee in-order deliveryof packets from source to destination targeting a given address.

Once the packet reaches a common ancestor it will begin routing down thesubtree. The first step in routing down is to select a logical downlinknumber. The down route configuration register contains shift (s) andmask (m) values that are used by first right-shifting the destinationprocessor number by s bits and then masking the bottom m bits to producethe logical output port number for the downlink. A rank 1 router, forexample, would have s=0 and m=00011111. The logical port number isconverted to a physical port number by a 64-entry port mapping table.The packet proceeds down the tree, shifting and masking the bits ofdestination processor to determine the downlink at each level, until itreaches the final egress port where it is sent to the processor'snetwork interface.

The Tile

In one embodiment, each tile 200 is broken into four blocks: the linkcontrol block (LCB), input buffers, 8×8 subswitch, and column buffers.The input buffer block contains 122 k cells (46% registers, 35% logic,and 19% SRAM) which includes the routing table and routing logic. Aconsiderable amount of this logic is dedicated to handling speculativedata forwarding—the LCB passing data up from the data-link layer priorto verifying the CRC—to handle error cases due to transmission errorsand soft errors. The 8×8 subswitch accounts for 141 k cells (54%registers, 25% logic, and 21% SRAM), or approximately ⅓ of the logic inthe tile. The subswitch contains the row buffers and logic that performsthe 8-to-1 arbitration among the row buffers, and a 2-to-1 arbitrationamongst the virtual channels. The column buffer block which alsoperforms the same two-stage arbitration as the subswitch only accountsfor 62 k cells (71% registers, and 29% logic). The column buffers areimplemented in latches, not SRAMs, so the bulk of the area in the columnbuffers is dedicated to latches. The remaining 111 k cells, or 25% ofthe tile area, is consumed by the LCB.

Selecting the Radix

The radix at which a network has minimum latency is largely determinedby the aspect ratio of the network router. As noted by Kim, aspect ratiois given by:

A=(Bt _(r) log N)/L

where B is the total bandwidth of a router, t_(r) is the per routerdelay, N is the size of the network, and L is the length of a packet. Inan embodiment where the aspect ratio is 1600, the optimal radix would be82.

While the optimal radix is 82, this is not a practical value. Tosimplify implementation and routing, the radix should be a power of 2. Aradix that is not a power of 2 would require an integer division andmodulo operation to determine the output port from a destinationaddress. In one design approach, we consider radices of 64, and 128.Both of these values give network latency within 2% of the optimalvalue. Although the higher radix of 128 theoretically leads to lowercost, this theory assumes that port widths can be varied continuously.In one embodiment, we selected a radix of 64 because it gives betterperformance with our pinout and integral port-width constraints.

In one radix-64 embodiment, area constraints limited us to no more than200 SerDes on the router chip. A radix-64 router using 3-bit wide portsrequires 192 SerDes, fitting nicely within this constraint. A radix-128router, on the other hand, is limited to 1-bit wide ports requiring 128SerDes. Such a router has only ⅔ the bandwidth of the radix 64 router,resulting in significantly lower performance.

Some computer systems have cabinet-to-cabinet spacing that requiresnetwork links longer than six meters, the maximum length that can bedriven reliably at the full signaling rate (6.25 Gb/s) of one embodimentof router 106. Such long links can be realized using optical signalingor using electrical cables with in-line repeaters. However, both ofthese alternatives carry a significant cost premium. If router 106supports variable signaling rates (as described for SerDes 702 above)and flexible routing enable, these long links can be realized usingelectrical signaling over passive cables by using a reverse taper. Byreducing the signaling rate on the link, significantly longer electricalcables can be driven. The reduced signaling rate can be offset for bydoubling the number of links provisioned at that level of the network (areverse taper) to preserve network bandwidth.

We chose a high-radix folded-Clos topology for computer systems 100 and120 because it offered both lower latency and lower cost thanalternatives such as a torus network while still providing 8.33 GB/s ofglobal memory bandwidth. We performed a zero-load latency comparison ofthe two different topologies. For the high-radix Clos network, radix-64routers were used. For the 3-D torus, the configurations used weresimilar to those of the Cray XT3. Uniform random traffic was assumed incalculating the average hop count of the network.

For a small size network, there is a 2× reduction in latency when goingfrom a 3-D torus to a high-radix Clos network. As the size of thenetwork increases, however, there is over a 4× reduction in latency.With the lower hop count, the high-radix Clos not only reduces latencybut also reduces cost. This is because network cost is approximatelyproportional to the total router bandwidth and, with the networkbisection held constant, it is proportional to the hop count. Thus,high-radix Clos networks lead to a lower latency and a lower costnetwork.

There are also several qualitative attributes of the high-radixfolded-Clos network which made it an attractive choice. Routing in torusis more complex as turn rules or virtual channels are needed to preventdeadlocks. In addition, complex routing algorithms are needed toproperly load balance across adversarial traffic pattern.

In contrast, the folded-Clos has very a straightforward routingalgorithm. Because of the path diversity in the topology, load balancingis achieved by selecting any one of the common ancestors. The foldedClos is also cycle-free by design so no additional virtual channels areneeded to break deadlock. VC allocation is often the critical path inthe router implementation and with fewer VCs, the VC allocation is alsosimplified.

Partitioning the Router

The radix-64 router 106 can be divided into multiple virtual routerswith lower degree. For instance, a single router 106 can serve as tworadix-32, four radix-16, or ten radix-6 virtual routers 106. Since eachtile 200 has its own set of routing tables 220 and keeps track of theset of allowable exit ports, system software can partition the routerinto multiple virtual routers by programming the routing tables 220associated with each virtual router with a set of masks that restrictsoutput traffic to the ports 192 of that virtual router. This flexibilityenables a router such as router 106 to be used in systems wherepackaging constraints require multiple lower radix routers.

Virtual routers can also be used to support multiple network slices in asingle YARC chip 201. For example, a single YARC chip 201 can beconfigured as two radix-32 routers to provide a radix-32 first stageswitch for two of the four BW network slices as shown in FIG. 3( c).

In one embodiment, router 106 employs virtual cut-through flow controlexternally but uses wormhole flow-control internally due to buffer sizeconstraints. In one such embodiment, the 64 input buffers 210 are eachsized deep enough (256 phits) to account for a round-trip credit latencyplus the length of a maximum-length packet (19 phits). This enables usto perform virtual cut-through flow control (with packet granularity) onexternal links.

It may not feasible, however, to size the 512 row buffers or 512 columnbuffers large enough to account for credit latency plus maximum packetsize. Thus wormhole flow control (at flit=phit granularity) is performedover both the row buses and the column channels to manage these buffers.In one embodiment, the row buffers 212 are 16 phits deep and the columnbuffers 214 are 10 phits deep—large enough to cover the credit latencyover the global column lines. Here a maximum-length packet can blocktraffic from the same input row to other outputs in the same column (byleaving its tail in the row buffer).

In a hierarchical high-radix router 106, a radix-k router is composed ofa number of p×p subswitches 202. The number needed is (k/p)². The costand performance of the router depend on p. As p is reduced, the designapproaches that of a fully buffered crossbar and becomes prohibitivelyexpensive but provides higher performance. As p is increased, the designapproaches an input-buffered crossbar and is inexpensive but has poorperformance.

To stress the hierarchical organization, we applied worst-case trafficto router 106 in which all of the offered traffic “turns the corner” ata single subswitch 202. With this approach, with an offered load of λ,one subswitch 202 in each row sees λp packets per cycle while the othersubswitches in the row are idle. In contrast, uniform random (UR)traffic does not stress the hierarchical organization because it evenlydistributes traffic across the k p subswitches 202 in a row with eachsubswitch 202 seeing only λp/k packets per cycle.

We wrote a simulator to evaluate the performance on worstcase trafficfor subswitches with degree p of 2, 4, 8, 16, and 32. Subswitches 220where p is 8, 16, or 32 perform almost identically with a throughput ofabout 60%. Since a p×p subswitch 202 provides an internal speedup ofk/p, (8, 4 and 2 respectively for p=8, 16 and 32), a sustainedthroughput of 60% provides more than sufficient performance for uniformtraffic. With an 8×8 subswitch 202, we can sustain approximately fivetimes the average traffic demand through our subswitch on uniformtraffic, providing plenty of headroom for non-uniform traffic patterns.

Although 8, 16, or 32 input subswitches 202 provide nearly identicalperformance, higher degree subswitches give lower cost because thebuffering required is O(k²/p). However, in one embodiment, we chose themore expensive p=8 configuration for two reasons. First, thehigher-degree subswitches required too much time to perform the p-to-1switch arbitration, which is a timing critical path in theimplementation. Early results showed that an 8-to-1 arbitration can bedone within a single 800 MHz clock cycle. A 16- or 32-to-1 arbitrationwould require a longer clock cycle or a pipelined arbiter. Second, asubswitch of size p=8 resulted in a modular design in which the numberof ports was equal to the number of subswitches. This enabled us tobuild a tile that contained a single subswitch, a single input, and asingle output.

A higher subswitch size would require each tile to have multipleinputs/outputs, while a smaller subswitch size would require severalsubswitches to share an input/output complicating the design effort ofthe tiles.

Fault Tolerance

The high path diversity of a high-radix folded-Clos network can beexploited to provide a degree of fault tolerance. The YARC chip 201 isdesigned to construct a network that provides graceful degradation inthe presence of the following faults: a failed network cable orconnector; a faulty router (i.e., a router 106 that stops responding);and a noisy high-speed serial lane that is causing excessive retries.

In a fault-free network, only a single entry in the routing table 220 isnecessary to specify the uplinks for the entire system. However,higher-priority table entries can be used to override this master entryto restrict routing to a set of destinations. If a fault occurs at aparticular node of the network, the routing tables can be set so thattraffic with destinations in the subtree beneath the fault do not routeto the fault or any ancestors of the fault. This is done by creating anentry that matches this set of destinations that has an uplink mask withthe bits corresponding to the faulty node and/or its ancestors cleared.

In one embodiment, the sender-side of each port maintains a forwardprogress countdown timer for each virtual channel. If the forwardprogress timer expires, it indicates that a packet has not flowed in along time and the router must prevent the error from propagatingthroughout the network. A forward progress timeout may happen if theattached processor stops accepting requests, causing the network to backpressure into the routers 106. Upon detection of a forward progresstimeout, an interrupt is raised to the maintenance controller to informthe system software that a node has stopped responding. The router willbegin discarding packets that are destined to port which incurred thetimeout.

In one embodiment, a link control block (LCB) handles the data-linklayer of the communication stack. It provides reliable packet deliveryacross each network link using a sliding window go-back-N protocol. Itmanages the interface between the higher-level core logic and thelower-level SerDes interface (physical layer). The LCB counts the numberof retries on a per-lane basis as a figure of merit for that serialchannel. System software defines a threshold for the number of tolerableretries for any of the serial lanes within the 3-lane port.

In one embodiment, if the LCB detects that the retry count exceeded thethreshold, it will automatically decommission the noisy lane and operatein a degraded (2-bit wide or 1-bit wide) mode until the cable can bechecked and possibly replaced. This allows the application to makeforward progress in the presence of persistent retries on a givennetwork link.

If all the lanes in the link are inoperable and must be disabled, theLCB will deassert the link active signal to the higher-level logic whichwill cause a system-level interrupt to the maintenance controller andcause the sending port to discard any packets destined to the dead port.This prevents the single link failure from cascading into the rest ofthe network.

A folded-Clos topology is cycle free and under normal operatingconditions is deadlock-free. In one embodiment, router 106 is designedto ensure the following invariant: once a packet begins traversingdownward, it remains going downward until it reaches the destination.That is, packets that arrived from an uplink must route to a downlink.This prevents packets from being caught in a cycle containing uplinksand downlinks. If the router is configured properly, this should neverhappen. However, software and the programmers who create it arefallible. This dynamic invariant should help reduce the debugging timewhen investigating early-production routing software.

Non-uniform traffic can cause local hot spots that significantlyincrease contention in interconnection networks. To reduce this networkload imbalance, in one embodiment router 106 performs two types of loadbalancing: hashing of deterministic routes to split bulk transfers upover multiple paths; and adaptive routing.

A system and method for enhancing diversity in routing is described in“LOAD BALANCING FOR COMMUNICATIONS WITHIN A MULTIPROCESSOR COMPUTERSYSTEM,” U.S. patent application Ser. No. xx/yyy,yyyy, filed herewith,the description of which is incorporated herein by reference.

Adaptive Routing

Implementing an adaptive routing scheme in a high-radix router isparticularly challenging because of the large number of ports involvedin the adaptive decision. Ideally, we would look at the congestion atall possible output ports (at most 32) and choose the queue with themost free space. Unfortunately, this is unrealistic in a 1.25 ns clockcycle. Instead, in keeping with the hierarchical organization of therouter, we break the adaptive decision into two stages: choosing theoutput column, and choosing the output row within that column.

We first choose the column, c, by comparing the congestion of the rowbuffers in each of the c row buffers identified by bits in the columnmask. A full eight-way, four-bit comparison of row buffer depths was tooexpensive. Instead we look only at the most significant bit (MSB) of therow buffer depth, giving priority to buffers that are less than halffull. We then select the column based on a round-robin arbitration, androute to the row buffers of the tile 201. This algorithm ignores thenumber of eligible output ports in each of the target columns, giving nopreference to a column with more eligible outputs. However, columns withmore eligible outputs will tend to drain faster, leading to more spacein their subswitch row buffers.

In the second stage of the adaptive route, we choose the output rowbased on the bits of the row mask which are set. The row mask identifiesthe set of valid output ports within the chosen column. We again mustrely on imperfect information to choose the output tile based on thedepth of the column buffers in the r rows, where r is the number of bitsset in the row mask. We choose among the rows by comparing two bits ofthe 4-bit column buffer depth (which is at most 10). The mostsignificant bit indicates if the column buffer is “almost full” (i.e. 8or more phits in the buffer), and the upper two-bits together indicateif the column buffer has more than 4 phits but less than 8phits—corresponding to “half full.” Finally, if the upper two bits ofthe buffer size are zero, then the column buffer is “almost empty.” Theadaptive decision will choose the column buffer based on its currentstate, giving preference to those ports which are “almost empty” thenthose that are “half full” and finally those buffers that are “almostfull.”

A system and method for flexible routing, including adaptive routing, isdescribed in “FLEXIBLE ROUTING TABLES FOR A HIGH-RADIX ROUTER,” U.S.patent application Ser. No. xx/yyy,yyyy, filed herewith, the descriptionof which is incorporated herein by reference.

Router 106 is a high-radix router used in the network of computersystems 100 and 120. Computer systems 100 and 120 that use routers 106with sixty-four 3-bit wide ports, scale up to 32K processors using afolded-Clos topology with a worst-case diameter of seven hops. In oneembodiment, each router 106 has an aggregate bandwidth of 2.4 Tb/s and a32K-processor BlackWidow system has a bisection bandwidth of 2.5 Pb/s.

Router 106 uses a hierarchical organization to overcome the quadraticscaling of conventional input-buffered routers. A two level hierarchy isorganized as an 8×8 array of tiles. This organization simplifiesarbitration with a minimal loss in performance. The tiled organizationalso resulted in a modular design that could be implemented in a shortperiod of time.

The architecture of router 106 is strongly influenced by the constraintsof modern ASIC technology. For instance, router 106 takes advantage ofabundant on-chip wiring to provide separate column buses from eachsubswitch to each output port, greatly simplifying output arbitration.To operate using limited on-chip buffering, router 106 uses wormholeflow control internally while using virtual-cut-through flow controlover external channels.

To reduce the cost and the latency of the network, computer systems 100and 120 use a folded-Clos network which, in some cases, is modified byadding sidelinks 108 to connect peer subtrees and statically partitionthe global network bandwidth. Such networks are superior to torusnetworks in terms of fault tolerance and bandwidth spreading. In someembodiments, both adaptive and deterministic routing algorithms areimplemented in the network to provide load-balancing across the networkand still maintain ordering on memory requests.

Speculative data forwarding allows one to reduce packet latency througha network while still providing reliable link transmission in hardwareusing a CRC-based sliding window protocol. It also allows us to providereliable transmission across links (requiring the use of CRCs), keep CRCoverhead down (as opposed to including a CRC with every few bytes ofdata), and still avoid introducing a significant store-and-forward delayat each hop while we wait for the next CRC to ensure reliabletransmission.

Although specific embodiments have been illustrated and describedherein, it will be appreciated by those of ordinary skill in the artthat any arrangement which is calculated to achieve the same purpose maybe substituted for the specific embodiment shown. This application isintended to cover any adaptations or variations of the presentinvention. Therefore, it is intended that this invention be limited onlyby the claims and the equivalents thereof.

1. A router, comprising: a plurality of subswitches arranged in a n×pmatrix, wherein each subswitch includes n inputs and p outputs, whereinboth n and p are greater than one; a plurality of input ports; localcontrol blocks connected to the plurality of input buffers, wherein thelocal control blocks include means for speculative forwarding of phitsreceived by the router; a plurality of output ports, wherein each outputport includes a multiplexer and an arbiter for selecting data to beswitched onto the output port via the multiplexer; a plurality of rowbusses, wherein each row bus receives data from one of the plurality ofinput ports and distributes the data to two or more of the plurality ofsubswitches; and a plurality of column channels, wherein each columnchannel connects one of the outputs of one of the subswitches to aninput of one of the multiplexers; wherein each row bus includes a routeselector, wherein the route selector includes a routing table whichselects an output port for each packet and which routes the packetthrough one of the row busses to the selected output port.
 2. The routerof claim 1, wherein n equals p.
 3. The router of claim 1, wherein thespeculative forwarding means includes means for discarding packets withincorrect cyclic redundancy codes.
 4. A router, comprising: a pluralityof subswitches arranged in a n×p matrix, wherein each subswitch includesn inputs and p outputs, wherein both n and p are greater than one; aplurality of input ports; local control blocks connected to theplurality of input buffers, wherein the local control blocks includemeans for speculative forwarding of phits received by the router; aplurality of output ports, wherein each output port includes amultiplexer and an arbiter for selecting data to be switched onto theoutput port via the multiplexer; distribution means, connected to theinput ports and the subswitches, for receiving data from one of theplurality of input ports and for distributing the data to two or more ofthe plurality of subswitches; and a plurality of column channels,wherein each column channel connects one of the outputs of one of thesubswitches to an input of one of the multiplexers; wherein thedistribution means includes a route selector, wherein the route selectorincludes a routing table which selects an output port for each packetand which routes the packet to the selected output port.
 5. The routerof claim 4, wherein n equals p.
 6. The router of claim 4, wherein thespeculative forwarding means includes means for discarding packets withincorrect cyclic redundancy codes.
 7. A computer system, comprising: aplurality of processor nodes; a plurality of first routers; and aplurality of second routers; wherein each first router is connected to aprocessor node and to two or more second routers and wherein each firstrouter includes: a plurality of subswitches arranged in a n×p matrix,wherein each subswitch includes p inputs and p outputs; a plurality ofinput ports; local control blocks connected to the plurality of inputbuffers, wherein the local control blocks include means for speculativeforwarding of phits received by the router; a plurality of output ports,wherein each output port includes a multiplexer and an arbiter forselecting data to be switched onto the output port via the multiplexer;a plurality of row busses, wherein each row bus receives data from oneof the plurality of input ports and distributes the data to two or moreof the plurality of subswitches; and a plurality of column channels,wherein each column channel connects one of the outputs of one of thesubswitches to an input of one of the multiplexers; wherein each row busincludes a route selector, wherein the route selector includes a routingtable which selects an output port for each packet and which routes thepacket through one of the row busses to the selected output port.
 8. Thecomputer system of claim 7, wherein n equals p.
 9. The computer systemof claim 7, wherein the speculative forwarding means includes means fordiscarding packets with incorrect cyclic redundancy codes.
 10. A methodof speculative forwarding of packets received by a router, wherein eachpacket includes phits and wherein one or more phits include a cyclicredundancy code (CRC), the method comprising: receiving a packet andforwarding phits of the packet to router logic; calculating a cyclicredundancy code; comparing the calculated cyclic redundancy code to thepacket's cyclic redundancy code and generating an error if the cyclicredundancy codes don't match; if the cyclic redundancy codes don'tmatch, modifying a phit of the packet to reflect the error, correctingthe CRC and forwarding the corrected CRC with the phit reflecting theerror to the router logic; at the router logic, determining if thepacket is still within the router logic; if the packet is still withinthe router logic and there was a CRC error, discarding the packet; andif the packet is no longer within the router logic and there was a CRCerror, modifying the packet so that the next router discards the packet.11. The method of claim 10, wherein comparing the calculated cyclicredundancy code to the packet's cyclic redundancy code includesdetermining if the packet has a length greater than the maximum packetlength.